The Taxonomy Dictionary: a resource for correct spelling of taxa

This article describes ‘The Taxonomy Dictionary’, a resource that can enhance the spelling engine of a text editor such as Word, so that it can correctly spell every taxon described and listed in the largest taxonomy databases. It contains around 1.4 million unique words, and once installed an incorrectly spelled taxon will be marked by the spelling engine and it will suggest possible correct spellings. Installation instructions for Firefox, LibreOffice and Microsoft Word can be found on the GitHub repository. The software is licensed with a GPL3 licence.


INTRODUCTION
This article presents a newly created resource, a digital dictionary capable of enhancing the spelling engine of a text editor, so it can spellcheck every taxonomic name in a document. Most people working with biology frequently encounter words or technical terms that have to be manually appended to the local user dictionary. This imposes the risk of incorrectly adding a mistyped word, and ever forward being misled on how to spell that word, until that error might subsequently be corrected. Another aspect is the sheer number of words that have to be added manually. Any automation of this process has the potential to save many hours of work and reduce spelling errors. The present resource was created to fill in that gap, and should be of value for any one working in an area of biology, especially where numbers of taxa are abundant and diverse.

CONSTRUCTION OF THE DICTIONARY
The Taxonomy Dictionary consists of a wordlist of around 1.4 million unique words. The wordlist has been generated by aggregating some of the largest taxonomy databases. The major ones are GBIF backbone [6] and Catalogue of Life (COL) [7]. Neither of these is complete, but they both aspire to contain every taxon described worldwide. To ascertain good coverage of microbes, taxonomy databases for viruses, bacteria and fungi have been added, namely the International Committee on Taxonomy of Viruses (ICTV) [8], List of Prokaryotic names with Standing in Nomenclature (LPSN) [9] and MycoBank [10]. For animals and plants Zoobank [11] and World Flora Online (WFO) [12] are used. Table 1 shows the version of each database. The data have been processed so that all metadata have been filtered out, and only the taxon names from each database are kept. The commands are linked together with a Bash [13] shell script available on the Github repository ( wordlist_ script. sh). The databases are downloaded with curl [14] or wget [15]. For the LPSN database the user can download it using their API. This is handled in a Python [16] script ( lpsnapi. py) that is called from the Bash script. Please be aware that the credentials in the python script need to be updated before the script can run (it is free to register on their site). The zip-compressed databases are unzipped. Some databases are downloaded as Excel files. The Excel files are exported to a comma-separated file (CSV) by another Python script ( csvexport. py). Then, cut [17] is used to extract the relevant columns containing information on taxa. For a dictionary to be readable by most text editors it must be presented as a long list of words, where each word is presented on a new line. Therefore, sed [18] is used to substitute tabs, commas and spaces with new lines and the outputs are redirected to a separate text file for each database. The text files are concatenated together with cat [19] and are then piped through sort [20] and duplicates are filtered out with uniq [21]. Finally, any words containing numbers or special characters are filtered out with sed, as they are not considered genuine entries. The result is a wordlist containing 1 400 952 unique words.

FUNCTIONALITY
Installation instructions for Firefox, LibreOffice and Microsoft Word can be found on the GitHub repository. After installation, an incorrectly spelled taxon will be marked by the spelling engine and it will suggest possible correct spellings. However, it will not check if italics are used correctly, so users should be aware of that. In most text editors it is possible to enable it alongside the preferred language. The dictionary is huge, but on any modern computer there should be no performance issues. Due to the size of the dictionary, the autosuggestions may be subject to error; this depends on the spelling engine and can occur if the user tries to spell a word for which there are many similar ones in the dictionary. Most of the time, however, it will suggest a reasonable correction.
If the unlikely should happen and the user finds a taxon not included or has a suggestion for a relevant database to add, they should please reach out, and I will try to incorporate it. I plan to update the dictionary 2-4 times a year depending on if there are any major releases to one of the databases upon which the dictionary is built.

Funding Information
No funding were received.

Acknowledgements
Thanks to Jose Alfredo Samaniego Castruita who showed me how to use sed. Thanks to every one who has contributed to the databases used for this project, and especially to their maintainers.
Author contributions K.B.: conceptualization and development of the software, and every step of writing this article, from the original draft to submission.

Conflicts of interest
The author of the article is also the author of The Taxonomy Dictionary. The author lists no further conflicts of interest.

Consent to publish
All the databases used for this work are licensed under a creative common licence (CC). All of them except Mycobank allow others to remix, adapt and build upon their work non-commercially. Mycobank is licensed under a CC BY-NC-ND licence. The author has obtained written permission to use their data in this work.
17. GNU P. Free software foundation. cut (GNU core utils); (n.d.) 18. GNU P. Free software foundation. GNU sed; (n.d.) 19. GNU P. Free software foundation. cat (GNU core utils); (n.d.) 20. GNU P. Free software foundation. sort (GNU core utils); (n.d.) 21. GNU P. Free software foundation. uniq (GNU core utils); (n.d.) Five reasons to publish your next article with a Microbiology Society journal 1. When you submit to our journals, you are supporting Society activities for your community.
2. Experience a fair, transparent process and critical, constructive review.
3. If you are at a Publish and Read institution, you'll enjoy the benefits of Open Access across our journal portfolio. 4. Author feedback says our Editors are 'thorough and fair' and 'patient and caring'. 5. Increase your reach and impact and share your research more widely.
Find out more and submit your article at microbiologyresearch.org. Comments: This study would be a valuable contribution to the existing literature. This is a study that would be of interest to the field and community. Thank you for your submission, we are pleased to accept the revised manuscript. Thank you for taking the effort in addressing the reviewers comments, especially in the additional work carried out with the pythons scripts and general automation of the process. Congratulations and we encourage submissions to ACMI in the future.

Author response to reviewers to Version 1
I would like to thank the editor and the reviewers for their comments. They have helped me develop a better product, and have motivated me to implement things I had planned for some time in the future (Firefox and LibreOffice extensions). I hope that you will find the dictionary and the resubmitted article better documented than before and a more mature product.

Reviewer 1:
Please rate the manuscript for methodological rigour This short manuscript describes a lexical application which can be loaded into packages such as Word to help ensure text contains correctly spelt taxonomic names of microbes. I cannot comment on the computational infrastructure used to create the tool. However, the tool itself may be of use to users in the microbiology community and so I am happy to recommend publication.
My only concern is that the tools as presented here seems to be a 'single shot' collation and filtering of names from the key databases. However, these databases expand by 1000s names per annum. It would be interesting to know if there plans to make this an iterative resource i.e., will periodic updates from the databases be incorporated (in addition to the manual updates hinted at in line 78)?

Minor comments:
Thanks for the corrections. The manuscript has been updated accordingly.
Lines 22 and 67: 1.412.046 might be clearer as "1.41 million" or "1,412,046" (as in line 55) Line 34 "public available, links" should be "publicly available; links" Line 49 "aspect are" should be "aspect is" Line 50 "process have" should be "process has" Line 56 "major" would read better than "biggest" Lines 59-61 "and fungi -that being; International… MycoBank [6] have been added." would read better as "and fungi have been added i.e., International… MycoBank [6]." Line 74 "autosuggestions are not always on spot." is a little vague. Perhaps "autosuggestions may be subject to error," Line 78 should read "and I will try to"

Reviewer 2:
Please rate the manuscript for methodological rigour Reviewer 2 Comments to Author: This manuscript describes a helper tool for scientists, in the form of a dictionary of recognised taxonomic terms that can be used to populate spellcheck software, and a script that was used to compile that dictionary.
As taxonomic terms are often arcane or cryptic, and not usually present in word processing software or other spellcheckers, such a dictionary could be of widespread convenience across a range of fields. Microbiology has a relative advantage in this over some other fields in having a set of agreed high-quality taxonomic resources that can be mined for such terms. The author has acquired multiple such directories of taxonomy, and claims to have compiled a wordlist, with one taxonomic term per line -the "digital dictionary" referred to in the manuscript.
The main contribution of the manuscript appears to be to advertise the existence of a text file that can be used as a dictionary file and incorporated into a user's own current spellchecker (presuming the dictionary format is accepted). The "tool" is therefore not a spellchecker in its own right, and I would suggest it might be better described as a "resource" rather than a "tool", as it is an input to a tool, and not an active piece of software.

I get your point. A dictionary in general might be better labeled as a resource. I have changed the term on the Github page and in the manuscript.
I found the compilation of the dictionary to be incompletely described in the manuscript (e.g. by reference to the script in the project repository). To be fair, the GitHub repository at https:// github. com/ kbagge/ Taxonomy_ dictionary/ tree/ v1.0 does contain a script that appears to have been used to generate the dictionary, and the Zenodo record for this is linked from the paper, but I would still expect to see an outline of the process used to convert the input directories into the final dictionary -this would be expected of a standard methodological description for a bioinformatics paper.

Thanks for pointing this out. I have added a detailed description of the steps carried out by the shell script in section "6. Construction of the dictionary".
The repository contains a shell script that provides instructions to the user explaining how the original taxonomy database files were obtained, but does not itself download them. This is a minimal level of reproducibility, but does not automate or make more user-friendly the process of acquiring the input data. For a resource like this I would expect a (relatively) easy to use automated tool to compile the list from the named sources. The impression I gained from the manuscript was that the process of (re)generating the dictionary would be automated but, as the GitHub repository notes: "The repository contains a script that was used to generate the dictionary. You can reproduce it yourself on your machine or get inspired and make your own dictionary for another topic. Please be aware that the script contains some manual steps that must be done before the rest can run. This was unavoidable since some of the databases needs to be downloaded manually others have to be exported from excel format." My view, as a bioinformatician, is that the manual steps are avoidable -downloads and Excel parsing can be automated and libraries exist in most common programming languages to make, for instance, automated interaction with Excel files possible.

Thanks for reminding me of this. I am still relatively new in programming, so sometimes it is hard to see the obvious. I made this project in Bash, and did not think about the possibility to use other languages. I have now made two Python scripts. One that downloads the LPSN database through their API (the script on Github needs to be filled in with login credentials). Another script handles the CSV export from excel. These steps have also been described in section "6. Construction of the dictionary".
I would be sympathetic to overlooking the need for manual downloads if the word list was useful as it stood. However, the word list appears to contain non-taxonomic terms and so has not been compiled cleanly (see https:// raw. githubusercontent. com/ kbagge/ Taxonomy_ dictionary/ v1. 0/ taxonomy. dic -commit 97a0350), e.g. these terms appear:

This was not an error in the compilation, it was a design choice. I designed the script to filter out any words containing no letters and kept any words that contained one or more while also allowing for numbers or special characters.
I shortly considered to filter out words that contained numbers or special characters, but were afraid that this might lead to filter out valid entries, especially bacteriophages can sometimes have weird names. After your comment, and after having learned that spellcheckers often ignore words that are a mixture of numbers and letters, I think that the best practice would be to filter out any words containing any numbers or special characters. I have updated the script to do so, and described it in the manuscript. This of course reduces the number of words in the dictionary and explains why it shrinks from 1.41 to 1.40 million words even though the databases have been updated. and I do not think they are all valid, recognised taxonomic terms. My view is that these inclusions likely derive by a combination of relatively informal taxonomic directory formats, and inadequate testing/incorrect parsing in the script. As the dictionary resource itself doesn't provide the claimed information (i.e. it includes a number of non-taxonomic terms) I do not think it -or the script that generates it -is yet ready for sharing/publication.

As for the validity of the terms please see above. As regarding the final product and the readiness for publication; I have developed a Firefox addon and a LibreOffice extension that can be easily installed through their official webpages. For Microsoft Word that are probably the most used editor, there are no easy way to install it with a click. The process explained on the Github do work even though it might be a bit cumbersome. I have expanded the explanation on Github instead of just linking to an external webpage describing it. This hopefully makes it better understandable.
LibreOffice Extension:https:// extensions. libreoffice. org/ en/ extensions/ show/ 27369 Firefox addon:https:// addons. mozilla. org/ en-US/ firefox/ addon/ the-taxonomy-dictionary/ I do think that the general idea is a good one, and that a fully-automated tool that downloads current data from the appropriate resources and compiles terms into a corresponding database/dictionary would be a publishable resource worth sharing. However, my view is that in its current state neither the script nor the dictionary meet the claims made the manuscript, or provide a reliable, reusable resource. I do think that this would be achievable with a limited amount of extra programming. I also think that the inclusion of a versioning scheme for the dictionary (even date-based versioning) would be an improvement, as it would allow users to know whether their copy of the dictionary was "current," and whether they should upgrade their local copy.
of agreed high-quality taxonomic resources that can be mined for such terms. The author has acquired multiple such directories of taxonomy, and claims to have compiled a wordlist, with one taxonomic term per line -the "digital dictionary" referred to in the manuscript. The main contribution of the manuscript appears to be to advertise the existence of a text file that can be used as a dictionary file and incorporated into a user's own current spellchecker (presuming the dictionary format is accepted). The "tool" is therefore not a spellchecker in its own right, and I would suggest it might be better described as a "resource" rather than a "tool", as it is an input to a tool, and not an active piece of software. I found the compilation of the dictionary to be incompletely described in the manuscript (e.g. by reference to the script in the project repository). To be fair, the GitHub repository at https:// github. com/ kbagge/ Taxonomy_ dictionary/ tree/ v1.0 does contain a script that appears to have been used to generate the dictionary, and the Zenodo record for this is linked from the paper, but I would still expect to see an outline of the process used to convert the input directories into the final dictionary -this would be expected of a standard methodological description for a bioinformatics paper. The repository contains a shell script that provides instructions to the user explaining how the original taxonomy database files were obtained, but does not itself download them. This is a minimal level of reproducibility, but does not automate or make more user-friendly the process of acquiring the input data. For a resource like this I would expect a (relatively) easy to use automated tool to compile the list from the named sources. The impression I gained from the manuscript was that the process of (re)generating the dictionary would be automated but, as the GitHub repository notes: "The repository contains a script that was used to generate the dictionary. You can reproduce it yourself on your machine or get inspired and make your own dictionary for another topic. Please be aware that the script contains some manual steps that must be done before the rest can run. This was unavoidable since some of the databases needs to be downloaded manually others have to be exported from excel format." My view, as a bioinformatician, is that the manual steps are avoidable -downloads and Excel parsing can be automated and libraries exist in most common programming languages to make, for instance, automated interaction with Excel files possible. I would be sympathetic to overlooking the need for manual downloads if the word list was useful as it stood. However, the word list appears to contain non-taxonomic terms and so has not been compiled cleanly (see https:// raw. githubusercontent. com/ kbagge/ Taxonomy_ dictionary/ v1. 0/ taxonomy. dic -commit 97a0350), e.g. these terms appear: 01-FULL-49-22b 01-FULL-54-110 02-12-FULL-59-9 02-FULL-45-10c 02-FULL-45-11b 02-FULL-45-17b 0507KN21 100268sal2 10-dentatus 10-fasciata 10-fasciatum 10-guttata 10-guttatus and I do not think they are all valid, recognised taxonomic terms. My view is that these inclusions likely derive by a combination of relatively informal taxonomic directory formats, and inadequate testing/incorrect parsing in the script. As the dictionary resource itself doesn't provide the claimed information (i.e. it includes a number of non-taxonomic terms) I do not think it -or the script that generates it -is yet ready for sharing/publication. I do think that the general idea is a good one, and that a fully-automated tool that downloads current data from the appropriate resources and compiles terms into a corresponding database/dictionary would be a publishable resource worth sharing. However, my view is that in its current state neither the script nor the dictionary meet the claims made the manuscript, or provide a reliable, reusable resource. I do think that this would be achievable with a limited amount of extra programming. I also think that the inclusion of a versioning scheme for the dictionary (even date-based versioning) would be an improvement, as it would allow users to know whether their copy of the dictionary was "current," and whether they should upgrade their local copy.

Please rate the manuscript for methodological rigour Poor
Please rate the quality of the presentation and structure of the manuscript Poor